28 research outputs found

    Exploring early and late ALUs for single-issue in-order pipelines

    Get PDF
    In-order processors are key components in energy-efficient embedded systems. One important design aspect of in-order pipelines is the sequence of pipeline stages: First, the position of the execute stage, in which arithmetic logic unit (ALU) operations and branch prediction are handled, impacts the number of stall cycles that are caused by data dependencies between data memory instructions and their consuming instructions and by address generation instructions that depend on an ALU result. Second, the position of the ALU inside the pipeline impacts the branch penalty. This paper considers the question on how to best make use of ALU resources inside a single-issue in-order pipeline. We begin by analyzing which is the most efficient way of placing a single ALU in an in-order pipeline. We then go on to evaluate what is the most efficient way to make use of two ALUs, one early and one late ALU, which is a technique that has revitalized commercial in-order processors in recent years. Our architectural simulations, which are based on 20 MiBench and 7 SPEC2000 integer benchmarks and a 65-nm postlayout netlist of a complete pipeline, show that utilizing two ALUs in different stages of the pipeline gives better performance and energy efficiency than any other pipeline configuration with a single ALU

    Design and Implementation of a Time Predictable Processor: Evaluation With a Space Case Study

    Get PDF
    Embedded real-time systems like those found in automotive, rail and aerospace, steadily require higher levels of guaranteed computing performance (and hence time predictability) motivated by the increasing number of functionalities provided by software. However, high-performance processor design is driven by the average-performance needs of mainstream market. To make things worse, changing those designs is hard since the embedded real-time market is comparatively a small market. A path to address this mismatch is designing low-complexity hardware features that favor time predictability and can be enabled/disabled not to affect average performance when performance guarantees are not required. In this line, we present the lessons learned designing and implementing LEOPARD, a four-core processor facilitating measurement-based timing analysis (widely used in most domains). LEOPARD has been designed adding low-overhead hardware mechanisms to a LEON3 processor baseline that allow capturing the impact of jittery resources (i.e. with variable latency) in the measurements performed at analysis time. In particular, at core level we handle the jitter of caches, TLBs and variable-latency floating point units; and at the chip level, we deal with contention so that time-composable timing guarantees can be obtained. The result of our applied study with a Space application shows how per-resource jitter is controlled facilitating the computation of high-quality WCET estimates

    Towards a performance- and energy-efficient data filter cache

    Full text link
    As CPU data requests to the level-one (L1) data cache (DC) can represent as much as 25% of an embedded processor\u27s total power dissipation, techniques that decrease L1 DC accesses can significantly enhance processor energy efficiency. Filter caches are known to efficiently decrease the number of accesses to instruction caches. However, due to the irregular access pattern of data accesses, a conventional data filter cache (DFC) has a high miss rate, which degrades processor performance. We propose to integrate a DFC with a fast address calculation technique to significantly reduce the impact of misses and to improve performance by enabling one-cycle loads. Furthermore, we show that DFC stalls can be eliminated even after unsuccessful fast address calculations, by simultaneously accessing the DFC and L1 DC on the following cycle. We quantitatively evaluate different DFC configurations, with and without the fast address calculation technique, using different write allocation policies, and qualitatively describe their impact on energy efficiency. The proposed design provides an efficient DFC that yields both energy and performance improvements

    Speculative tag access for reduced energy dissipation in set-associative L1 data caches

    Full text link
    Due to performance reasons, all ways in set-associative level-one (L1) data caches are accessed in parallel for load operations even though the requested data can only reside in one of the ways. Thus, a significant amount of energy is wasted when loads are performed. We propose a speculation technique that performs the tag comparison in parallel with the address calculation, leading to the access of only one way during the following cycle on successful speculations. The technique incurs no execution time penalty, has an insignificant area overhead, and does not require any customized SRAM implementation. Assuming a 16kB 4-way set-associative L1 data cache implemented in a 65-nm process technology, our evaluation based on 20 different MiBench benchmarks shows that the proposed technique on average leads to a 24% data cache energy reduction

    EPC Enacted: Integration in an Industrial Toolbox and Use against a Railway Application

    Get PDF
    Measurement-based timing analysis approaches are increasingly making their way into several industrial domains on account of their good cost-benefit ratio. The trustworthiness of those methods, however, suffers from the limitation that their results are only valid for the particular paths and execution conditions that the user is able to explore with the available input vectors. It is generally not possible to guarantee that the collected measurements are fully representative of the worst-case timing behaviour. In the context of measurement-based probabilistic timing analysis, the Extended Path Coverage (EPC) approach has been recently proposed as a means to extend the representativeness of measurement observations, to obtain the same effect of full path coverage. At the time of its first publication, EPC had not reached an implementation maturity that could be trialled industrially. In this work we analyze the practical implications of using EPC with real-world applications, and discuss the challenges in integrating it in an industrial-quality toolchain. We show that we were able to meet EPC requirements and successfully evaluate the technique on a real Railway application, on top of a commercial toolchain and full execution stack.This work has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] under grant agreement 611085 (PROXIMA, www.proxima-project.eu). This work has also been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant TIN2015-65316-P and the HiPEAC Network of Excellence. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717. The authors are grateful to Antoine Colin from Rapita Ltd. for his precious support.Peer ReviewedPostprint (author's final draft

    Probabilistic timing analysis on time-randomized platforms for the space domain

    Get PDF
    Timing Verification is a fundamental step in real-time embedded systems, with measurement-based timing analysis (MBTA) being the most common approach used to that end. We present a Space case study on a real platform that has been modified to support a probabilistic variant of MBTA called MBPTA. Our platform provides the properties required by MBPTA with the predicted WCET estimates with MBPTA being competitive to those with current MBTA practice while providing more solid evidence on their correctness for certification.The research leading to these results has received funding from the European Community’s FP7 [FP7/2007-2013] under the PROXIMA Project (www.proxima-project.eu), grant agreement no 611085. This work has also been partially supported by the Spanish Ministry of Science and Innovation under grant TIN2015-65316-P and the HiPEAC Network of Excellence. Jaume Abella has been partially supported by the Ministry of Economy and Competitiveness under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717. Carles Hernandez is jointly funded by the Spanish Ministry of Economy and Competitiveness and FEDER funds through grant TIN2014-60404-JIN.Peer ReviewedPostprint (author's final draft

    Data Access Techniques for Enhanced Energy Efficiency and Performance in In-order Pipelines

    No full text
    Energy efficiency is one of the key metrics in the design of a widerange of processor types. For example, battery powered devices, whichare growing in numbers every day, require energy efficient processorsto be able to operate for a useful period of time. Techniques thatimproves the energy efficiency of a processor can alleviate theproblems like heat generation to a certain level which in turn canallow to achieve better performance. In addition, energy efficiencyreduces the operating costs of high performance computing systemswhich is very desirable.Level-1 data caches (L1 DC) dissipate a significant portion of thepipeline energy in general purpose processors. For example, L1 DC candissipate up to 23% of the pipeline energy in a 7-stage single-issuein-order pipeline. In this thesis, a number of techniques areintroduced in order to reduce the energy dissipation of L1 DCs. Focusis given to reduce the L1 DC energy without reducing the performance,since L1 DC accesses affect the performance of the processor. Some ofthe techniques introduced in this thesis can even improve the performanceof the processor slightly. In addition, the ease of implementation is oneof the important considerations in this thesis, in which the energysaving techniques should be able implementable with the commonsemi-custom design flows. Some of the proposedtechniques reduce the energy dissipation of data translationlookaside buffer (DTLB) which is closely coupled with L1 DC.Two of the papers that are included in this thesis, that is,Speculative Tag Access (STA) and Early Load Data Dependence Detection(ELD^3) are very simple to implement in order toreduce the L1 DC access energy. Another Two papers are included in thethesis are about filter caches, but the main focus is given to theData Filter Cache (DFC). The first paper tackles the implementationissues related to previously proposed data filter caches and proposesnovel ways to utilize DFC in the pipeline to reduce the energydissipation of both L1 DC and DTLB, but also improve the performanceat the same time. The second paper, utilizes filter caches forwide-voltage-range processors in order to tackle the issue ofscalability problems of SRAMs used in level-1 caches. A paper abouthardware/software co-design technique is introduced to evaluate thepotential of software control on the energy efficiency of L1 DCaccesses. In the final paper that is included in the thesis, a 7-stagepipeline is evaluated in detail in terms of execute stage and L1 DCaccess stage which affect the performance directly due to datadependencies

    Techniques to Reduce Energy Dissipation in Level-1 Data Caches

    No full text
    The number of battery powered devices is growing significantly and these devices require energy-efficient hardware to operate for a useful period of time. Also, the future of the performance scaling in processors depends on the energy efficiency as the increasing number of cores on a single chip does not leave room for inefficient microarchitectures. Thus, energy efficiency has become one of the most important goals in the design of a wide range of processor types. The energy dissipation of data caches represents a significant portion of total energy for general-purpose processors. In this thesis, three new techniques are introduced to reduce the energy dissipation of level-one data caches (L1 DCs). In addition, some of the presented work addresses the energy dissipation of the data translation lookaside buffer also, which is closely related to the L1 DC. Since data caches affect the processor performance, energy-saving techniques must be considered in terms of their impact on performance. Some ofthe proposed techniques allow the L1 DC to be accessed early in the pipeline, improving processor performance. Two of the presented techniques, that is, the tagless access buffer (TAB) and the data filter cache (DFC), reduce the energy dissipation of data caches and improve the overall performance by diverting part of the data accesses to a very small and energy-efficient cache or buffer structure. TAB uses hardware/software co-design to achieve thisgoal, while DFC is entirely based on hardware. Although the software control in TAB enables an efficient hardware implementation and less redundant line fetch operations from L1 DC and the higher levels in the hierarchy, it requires modifications to the instruction set architecture which can be impractical due to binary incompatibility.The DFC can achieve very significant energy gains by means of hardware control only. The third technique presented in this thesis, the speculative tag access, improves the efficiency of the L1 DC by performing the tag match operation early in the pipeline in a speculative way. In this manner, only one way of the data is accessed on a speculation success. Compared to the other two techniques, the complexity of control and the area overhead is very low and yet the energy reduction is significant

    Data Access Techniques for Enhanced Energy Efficiency and Performance in In-order Pipelines

    No full text
    Energy efficiency is one of the key metrics in the design of a wide range of processor types. For example, battery powered devices, which are growing in numbers every day, require energy efficient processors to be able to operate for a useful period of time. Techniques that improves the energy efficiency of a processor can alleviate the problems like heat generation to a certain level which in turn can allow to achieve better performance. In addition, energy efficiency reduces the operating costs of high performance computing systems which is very desirable. Level-1 data caches (L1 DC) dissipate a significant portion of the pipeline energy in general purpose processors. For example, L1 DC can dissipate up to 23% of the pipeline energy in a 7-stage single-issue in-order pipeline. In this thesis, a number of techniques are introduced in order to reduce the energy dissipation of L1 DCs. Focus is given to reduce the L1 DC energy without reducing the performance, since L1 DC accesses affect the performance of the processor. Some of the techniques introduced in this thesis can even improve the performance of the processor slightly. In addition, the ease of implementation is one of the important considerations in this thesis, in which the energy saving techniques should be able implementable with the common semi-custom design flows. Some of the proposed techniques reduce the energy dissipation of data translation lookaside buffer (DTLB) which is closely coupled with L1 DC. Two of the papers that are included in this thesis, that is, Speculative Tag Access (STA) and Early Load Data Dependence Detection (ELD^3) are very simple to implement in order to reduce the L1 DC access energy. Another Two papers are included in the thesis are about filter caches, but the main focus is given to the Data Filter Cache (DFC). The first paper tackles the implementation issues related to previously proposed data filter caches and proposes novel ways to utilize DFC in the pipeline to reduce the energy dissipation of both L1 DC and DTLB, but also improve the performance at the same time. The second paper, utilizes filter caches for wide-voltage-range processors in order to tackle the issue of scalability problems of SRAMs used in level-1 caches. A paper about hardware/software co-design technique is introduced to evaluate the potential of software control on the energy efficiency of L1 DC accesses. In the final paper that is included in the thesis, a 7-stage pipeline is evaluated in detail in terms of execute stage and L1 DC access stage which affect the performance directly due to data dependencies
    corecore